AITopics | acoustic scene

Collaborating Authors

acoustic scene

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Sci-Phi: A Large Language Model Spatial Audio Descriptor

Jiang, Xilin, Gamper, Hannes, Braun, Sebastian

arXiv.org Artificial IntelligenceOct-8-2025

Acoustic scene perception involves describing the type of sounds, their timing, their direction and distance, as well as their loudness and reverberation. While audio language models excel in sound recognition, single-channel input fundamentally limits spatial understanding. This work presents Sci-Phi, a spatial audio large language model with dual spatial and spectral encoders that estimates a complete parameter set for all sound sources and the surrounding environment. Learning from over 4,000 hours of synthetic first-order Ambisonics recordings including metadata, Sci-Phi enumerates and describes up to four directional sound sources in one pass, alongside non-directional background sounds and room characteristics. We evaluate the model with a permutation-invariant protocol and 15 metrics covering content, location, timing, loudness, and reverberation, and analyze its robustness across source counts, signal-to-noise ratios, reverberation levels, and challenging mixtures of acoustically, spatially, or temporally similar sources. Notably, Sci-Phi generalizes to real room impulse responses with only minor performance degradation. Overall, this work establishes the first audio LLM capable of full spatial-scene description, with strong potential for real-world deployment. Demo: https://sci-phi-audio.github.io/demo

large language model, natural language, sci-phi, (19 more...)

arXiv.org Artificial Intelligence

2510.05542

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

AISTAT lab system for DCASE2025 Task6: Language-based audio retrieval

Kim, Hyun Jun, Choi, Hyeong Yong, Lim, Changwon

arXiv.org Artificial IntelligenceSep-23-2025

ABSTRACT This report presents the AIST A T team's submission to the lan guage-based audio retrieval task in DCASE 2025 Task 6. Our proposed system employs dual encoder architecture, where audi o and text modalities are encoded separately, and their repre senta-tions are aligned using contrastive learning. Additionally, we incorporat ed clustering to introduce an auxiliary classification task for fur ther fine-tuning. Our best single system achieved a mAP@16 of 46.62, wh ile an ensem-ble of four systems reached a mAP@16 of 48.83 on the Clotho development test split. Index T erms -- Audio-text retrieval, contrastive learning, knowledge distillation, topic modeling 1. INTRODUCTION DCASE 2025 Task 6 challenge [1] focuses on language-based au - dio retrieval, a task that requires retrieving audio record ings from a database that best matches a given textual query, and vice v ersa.

caption, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2509.16649

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)

Add feedback

IS${}^3$ : Generic Impulsive--Stationary Sound Separation in Acoustic Scenes using Deep Filtering

Berger, Clémentine, Stamatiadis, Paraskevas, Badeau, Roland, Essid, Slim

arXiv.org Artificial IntelligenceSep-15-2025

We are interested in audio systems capable of performing a differentiated processing of stationary backgrounds and isolated acoustic events within an acoustic scene, whether for applying specific processing methods to each part or for focusing solely on one while ignoring the other. Such systems have applications in real-world scenarios, including robust adaptive audio rendering systems (e.g., EQ or compression), plosive attenuation in voice mixing, noise suppression or reduction, robust acoustic event classification or even bioacoustics. To this end, we introduce IS${}^3$, a neural network designed for Impulsive--Stationary Sound Separation, that isolates impulsive acoustic events from the stationary background using a deep filtering approach, that can act as a pre-processing stage for the above-mentioned tasks. To ensure optimal training, we propose a sophisticated data generation pipeline that curates and adapts existing datasets for this task. We demonstrate that a learning-based approach, build on a relatively lightweight neural architecture and trained with well-designed and varied data, is successful in this previously unaddressed task, outperforming the Harmonic--Percussive Sound Separation masking method, adapted from music signal processing research, and wavelet filtering on objective separation metrics.

artificial intelligence, dataset, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2509.02622

Country: Asia > Japan (0.28)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Performance improvement of spatial semantic segmentation with enriched audio features and agent-based error correction for DCASE 2025 Challenge Task 4

Park, Jongyeon, Lee, Joonhee, Lim, Do-Hyeon, Kim, Hong Kook, Geum, Hyeongcheol, Lim, Jeong Eun

arXiv.org Artificial IntelligenceJun-27-2025

This technical report presents submission systems for Task 4 of the DCASE 2025 Challenge. This model incorporates additional audio features (spectral roll-off and chroma features) into the embedding feature extracted from the mel-spectral feature to im-prove the classification capabilities of an audio-tagging model in the spatial semantic segmentation of sound scenes (S5) system. This approach is motivated by the fact that mixed audio often contains subtle cues that are difficult to capture with mel-spectrograms alone. Thus, these additional features offer alterna-tive perspectives for the model. Second, an agent-based label correction system is applied to the outputs processed by the S5 system. This system reduces false positives, improving the final class-aware signal-to-distortion ratio improvement (CA-SDRi) metric. Finally, we refine the training dataset to enhance the classi-fication accuracy of low-performing classes by removing irrele-vant samples and incorporating external data. That is, audio mix-tures are generated from a limited number of data points; thus, even a small number of out-of-class data points could degrade model performance. The experiments demonstrate that the submit-ted systems employing these approaches relatively improve CA-SDRi by up to 14.7% compared to the baseline of DCASE 2025 Challenge Task 4.

artificial intelligence, machine learning, spectral roll, (14 more...)

arXiv.org Artificial Intelligence

2506.21174

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.89)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.62)

Add feedback

Leveraging Reverberation and Visual Depth Cues for Sound Event Localization and Detection with Distance Estimation

Berghi, Davide, Jackson, Philip J. B.

arXiv.org Artificial IntelligenceOct-29-2024

This report describes our systems submitted for the DCASE2024 Task 3 challenge: Audio and Audiovisual Sound Event Localization and Detection with Source Distance Estimation (Track B). Our main model is based on the audio-visual (AV) Conformer, which processes video and audio embeddings extracted with ResNet50 and with an audio encoder pre-trained on SELD, respectively. This model outperformed the audio-visual baseline of the development set of the STARSS23 dataset by a wide margin, halving its DOAE and improving the F1 by more than 3x. Our second system performs a temporal ensemble from the outputs of the AV-Conformer. We then extended the model with features for distance estimation, such as direct and reverberant signal components extracted from the omnidirectional audio channel, and depth maps extracted from the video frames. While the new system improved the RDE of our previous model by about 3 percentage points, it achieved a lower F1 score. This may be caused by sound classes that rarely appear in the training set and that the more complex system does not detect, as analysis can determine. To overcome this problem, our fourth and final system consists of an ensemble strategy combining the predictions of the other three. Many opportunities to refine the system and training strategy can be tested in future ablation experiments, and likely achieve incremental performance gains for this audio-visual task.

artificial intelligence, detection and classification, machine learning, (12 more...)

arXiv.org Artificial Intelligence

2410.22271

Country: Europe > United Kingdom > England > Surrey > Guildford (0.05)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval

Primus, Paul, Schmid, Florian, Widmer, Gerhard

arXiv.org Artificial IntelligenceAug-21-2024

Dual-encoder-based audio retrieval systems are commonly optimized with contrastive learning on a set of matching and mismatching audio-caption pairs. This leads to a shared embedding space in which corresponding items from the two modalities end up close together. Since audio-caption datasets typically only contain matching pairs of recordings and descriptions, it has become common practice to create mismatching pairs by pairing the audio with a caption randomly drawn from the dataset. This is not ideal because the randomly sampled caption could, just by chance, partly or entirely describe the audio recording. However, correspondence information for all possible pairs is costly to annotate and thus typically unavailable; we, therefore, suggest substituting it with estimated correspondences. To this end, we propose a two-staged training procedure in which multiple retrieval models are first trained as usual, i.e., without estimated correspondences. In the second stage, the audio-caption correspondences predicted by these models then serve as prediction targets. We evaluate our method on the ClothoV2 and the AudioCaps benchmark and show that it improves retrieval performance, even in a restricting self-distillation setting where a single model generates and then learns from the estimated correspondences. We further show that our method outperforms the current state of the art by 1.6 pp. mAP@10 on the ClothoV2 benchmark.

caption, correspondence, dataset, (14 more...)

arXiv.org Artificial Intelligence

2408.11641

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.16)
Europe > France (0.05)
Europe > Finland > Uusimaa > Helsinki (0.05)
(10 more...)

Genre: Research Report (0.50)

Industry: Media (0.36)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Description and Discussion on DCASE 2024 Challenge Task 2: First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring

Nishida, Tomoya, Harada, Noboru, Niizumi, Daisuke, Albertini, Davide, Sannino, Roberto, Pradolini, Simone, Augusti, Filippo, Imoto, Keisuke, Dohi, Kota, Purohit, Harsh, Endo, Takashi, Kawaguchi, Yohei

arXiv.org Artificial IntelligenceJun-11-2024

We present the task description of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2024 Challenge Task 2: First-shot unsupervised anomalous sound detection (ASD) for machine condition monitoring. Continuing from last year's DCASE 2023 Challenge Task 2, we organize the task as a first-shot problem under domain generalization required settings. The main goal of the first-shot problem is to enable rapid deployment of ASD systems for new kinds of machines without the need for machine-specific hyperparameter tunings. This problem setting was realized by (1) giving only one section for each machine type and (2) having completely different machine types for the development and evaluation datasets. For the DCASE 2024 Challenge Task 2, data of completely new machine types were newly collected and provided as the evaluation dataset. In addition, attribute information such as the machine operation conditions were concealed for several machine types to mimic situations where such information are unavailable. We will add challenge results and analysis of the submissions after the challenge submission deadline.

dataset, machine type, task 2, (16 more...)

arXiv.org Artificial Intelligence

2406.0725

Country:

North America > United States (0.05)
Asia > Japan (0.05)
Europe > Italy (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
Information Technology > Data Science > Data Mining > Anomaly Detection (0.31)

Add feedback

Why do Angular Margin Losses work well for Semi-Supervised Anomalous Sound Detection?

Wilkinghoff, Kevin, Kurth, Frank

arXiv.org Artificial IntelligenceNov-24-2023

State-of-the-art anomalous sound detection systems often utilize angular margin losses to learn suitable representations of acoustic data using an auxiliary task, which usually is a supervised or self-supervised classification task. The underlying idea is that, in order to solve this auxiliary task, specific information about normal data needs to be captured in the learned representations and that this information is also sufficient to differentiate between normal and anomalous samples. Especially in noisy conditions, discriminative models based on angular margin losses tend to significantly outperform systems based on generative or one-class models. The goal of this work is to investigate why using angular margin losses with auxiliary tasks works well for detecting anomalous sounds. To this end, it is shown, both theoretically and experimentally, that minimizing angular margin losses also minimizes compactness loss while inherently preventing learning trivial solutions. Furthermore, multiple experiments are conducted to show that using a related classification task as an auxiliary task teaches the model to learn representations suitable for detecting anomalous sounds in noisy conditions. Among these experiments are performance evaluations, visualizing the embedding space with t-SNE and visualizing the input representations with respect to the anomaly score using randomized input sampling for explanation.

compactness loss, ic compactness loss, machine type, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TASLP.2023.3337153

2309.15643

Country:

Europe > Finland > Pirkanmaa > Tampere (0.05)
North America > United States (0.04)
Europe > Germany > North Rhine-Westphalia > Münster Region > Münster (0.04)
Europe > Germany > North Rhine-Westphalia > Cologne Region > Bonn (0.04)

Genre: Research Report (0.50)

Industry:

Health & Medicine (0.67)
Media (0.46)
Leisure & Entertainment (0.46)
Information Technology (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

FALL-E: A Foley Sound Synthesis Model and Strategies

Kang, Minsung, Oh, Sangshin, Moon, Hyeongi, Lee, Kyungyun, Chon, Ben Sangbae

arXiv.org Artificial IntelligenceAug-10-2023

This paper introduces FALL-E, a foley synthesis system and its training/inference strategies. The FALL-E model employs a cascaded approach comprising low-resolution spectrogram generation, spectrogram super-resolution, and a vocoder. We trained every sound-related model from scratch using our extensive datasets, and utilized a pre-trained language model. We conditioned the model with dataset-specific texts, enabling it to learn sound quality and recording environment based on text input. Moreover, we leveraged external language models to improve text descriptions of our datasets and performed prompt engineering for quality, coherence, and diversity. FALL-E was evaluated by an objective measure as well as listening tests in the DCASE 2023 challenge Task 7. The submission achieved the second place on average, while achieving the best score for diversity, second place for audio quality, and third place for class fitness.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2306.09807

Country:

Europe > Finland > Pirkanmaa > Tampere (0.05)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)

Add feedback

Efficient Audio Captioning Transformer with Patchout and Text Guidance

Kouzelis, Thodoris, Bastas, Grigoris, Katsamanis, Athanasios, Potamianos, Alexandros

arXiv.org Artificial IntelligenceApr-6-2023

Automated audio captioning is multi-modal translation task that aim to generate textual descriptions for a given audio clip. In this paper we propose a full Transformer architecture that utilizes Patchout as proposed in [1], significantly reducing the computational complexity and avoiding overfitting. The caption generation is partly conditioned on textual AudioSet tags extracted by a pre-trained classification model which is fine-tuned to maximize the semantic similarity between AudioSet labels and ground truth captions. To mitigate the data scarcity problem of Automated Audio Captioning we introduce transfer learning from an upstream audio-related task and an enlarged in-domain dataset. Moreover, we propose a method to apply Mixup augmentation for AAC. Ablation studies are carried out to investigate how Patchout and text guidance contribute to the final performance. The results show that the proposed techniques improve the performance of our system and while reducing the computational complexity. Our proposed method received the Judges Award at the Task6A of DCASE Challenge 2022.

caption, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2304.02916

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Greece > Attica > Athens (0.05)
Europe > Spain (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback